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Abstract 

Tree transducers are formal automata that 
transform trees into other trees. Many va- 
rieties of tree transducers have been ex- 
plored in the automata theory literature, 
and more recently, in the machine trans- 
lation literature. In this paper I review T 
and xT transducers, situate them among 
related formalisms, and show how they 
can be used to implement rules for ma- 
chine translation systems that cover all of 
the cross-language structural divergences 
described in Bonnie Dorr's influential arti- 
cle on the topic. I also present an imple- 
mentation of xT transduction, suitable and 
convenient for experimenting with transla- 
tion rules. 

1 Introduction 

Word-based approaches to statistical machine 
translation, starting with the work from IBM in 
the early 1990s (Brown et al., 1993) have been 
successful both in use in production translation 
systems and in invigorating MT research. Since 
then, newer phrase-based MT techniques such 
as the alignment template model (Och and Ney, 
2004), and hierarchical phrase-based models (Chi- 
ang, 2005) have made significant improvements in 
SMT translation quality. 

Despite their sophistication and apparent com- 
plexity, many word-based and phrase-based SMT 
models can be implemented entirely in terms of 
finite-state transducers. This allows researchers to 
make use of the rich automata literature for find- 
ing clean and efficient algorithms; it is also use- 
ful from a software engineering perspective, mak- 
ing it possible to do experiments quickly, using 
generic toolkits for programmatically manipulat- 
ing finite-state transducers. Several such packages 
are freely available, such as OpenFST (Allauzen 



et al., 2007) and the WRTH FSA Toolkit (Kanthak 
and Ney, 2004). 

However, since they make no attempt to explic- 
itly model the syntax of either involved language, 
and typically use simple n-gram models to guide 
generation, the output of word-based SMT sys- 
tems can be syntactically incoherent, especially in 
light of long-distance dependencies. Additionally, 
word-based SMT models have difficulties encod- 
ing word order differences across languages. 

So we have seen new methods in MT that ex- 
plicitly model syntax, where typically the gram- 
mar of a language, and the relationships between 
the grammars of two languages, can be learned 
from treebanks. There are many different avail- 
able theoretical frameworks for describing syn- 
tax and transformations over syntactic represen- 
tations. Both from a theoretical standpoint, and 
as MT implementors, we would like a framework 
that is clean and general, and is suitably expressive 
for explicitly capturing syntactic structures and the 
divergences across languages. We would also like 
one for which there are efficient algorithms for 
training rules and performing transduction (i.e., 
decoding at translation time), and ideally one for 
which a good software toolkit is freely available. 

Not all syntactic relationships can be cleanly 
represented with every syntactic formalism; each 
formalism has its own expressive power. Bon- 
nie Dorr provides us with an excellent test bed 
of seven cross-language divergences that may oc- 
cur when we want to perform translation, even 
between languages as closely related as English, 
Spanish and German (Dorr, 1994). While these 
divergences do not totally describe the ways in 
which languages can differ in their typical descrip- 
tions of an event, they provide a concrete starting 
point, and are easily accessible. 

In this paper, I specifically investigate T and 
xT transducers, situate them in the space of for- 
malisms for describing syntax-based translation, 



and demonstrate that xT transducers are sufficient 
for modeling all of the syntactic divergences iden- 
tified by Dorr. I also present kurt, a small soft- 
ware toolkit for experimenting with these trans- 
ducers, which comes with sample translation rules 
that handle each of Dorr's divergences. 

In the rest of this paper, we will discuss some 
relevant grammar and transducer formalisms, in- 
cluding a more in-depth look at T and xT trans- 
ducers; go through the linguistic divergences dis- 
cussed by Dorr and explain why they might cause 
difficulties for MT systems; show how xT trans- 
ducers can be used to address each of these diver- 
gences; present the software that I have built; re- 
view some of the related work that has informed 
this paper; and finally, suggest future possible di- 
rections for work with tree transducer-based MT. 

2 Grammars and Transducers 

Here we contrast several kinds of formalisms over 
strings, trees, and pairs of strings and trees; please 
see Figure 2 for a glossary of different kinds of 
automata and grammars that will be referenced in 
this paper. A grammar describes a single set of 
strings or trees, and consists of a finite set of rules 
that describes those strings or trees. Familiar for- 
malisms for grammars that describe sets of strings 
include context-free grammars and the other mem- 
bers of the Chomsky Hierarchy. Some grammars 
describe sets of trees, and these will be the main 
focus of the rest of this paper; when discussing 
grammars over strings, I will specifically mention 
it. For example, regular tree grammars (RTG) is 
the class of grammars corresponding to context- 
free grammars but describing trees; they describe 
the trees whose yield (string concatenation of the 
symbols at the leaves) is a context-free grammar 
(Knight and Graehl, 2005). 

Contrastingly, synchronous grammars describe 
sets of pairs of objects; here again, we are mostly 
concerned with synchronous grammars that de- 
scribe trees. Formally, a synchronous grammar 
over trees establishes a mathematical relation over 
two sets of trees, and allows us to answer the 
question of whether, for a given pair of trees, 
that pair is in the relation. The production rules 
of a synchronous grammar do not just describe 
one language, but have pairs of production rules 
< i"i,r2 >, such that when n is used to derive 
a string in language L\, T2 must be used in the 
derivation of a string in L2. 



Thus synchronous grammars can be used for 
several kinds of tasks, such as parsing parallel 
texts, generating parallel text, or most intuitively 
useful for a machine translation setting, parsing 
text in one language while jointly generating parse 
trees that yield text the other. All of these op- 
erations are described for synchronous context- 
free grammars in David Chiang's tutorial (Chiang, 
2006). In his tutorial, Chiang describes some of 
the limitations of using synchronous CFGs; no- 
tably, they cannot rearrange parts of parse trees 
that are not sisters. Of particular interest in this 
work is raising and lowering elements; Chiang 
gives the example of swapping subjects and ob- 
jects, as in the example of translation between 
English and French in Figure 1. Chiang points 
out that, for syntax-aware MT, we would like to 
be able to use some more powerful formalism 
that can perform transformations like this. Syn- 
chronous tree substitution grammars, for exam- 
ple, are able to describe transformations of this 
form, but not the transformation from cross-serial 
dependencies in subordinate clauses in Dutch to 
the nested clause structure of English. This latter 
transformation would require more formal power, 
which is offered by tree-adjoining grammars. 

2.1 TAG and Related Formalisms 

Tree adjoining grammar, introduced by Joshi 
(Joshi et al., 1975), has been a popular formalism 
for describing grammars over trees. It provides ad- 
ditional expressive power not available in regular 
tree grammars, handling some, but not all context- 
sensitive languages. TAG can cleanly describe 
many of the non-context-free features observed in 
human languages, such as the cross-serial depen- 
dencies in Dutch. TAG is thus called "weakly 
context-sensitive", and has been shown formally 
equivalent to several other syntactic formalisms, 
such as Combinatory Categorial Grammar (CCG) 
and Linear Indexed Grammars (Vijay-Shanker and 
Weir, 1994). 

The operations of TAG are substitution and ad- 
junction, which combine the two different kinds 
of elementary trees present in a given TAG gram- 
mar, initial trees and auxiliary trees. The substi- 
tution operation takes two trees, one with a leaf 
that is an unresolved nonterminal a, and produces 
a new tree in which that node has been replaced 
with an entire subtree (copied from another initial 
tree in the grammar, or one that has already been 
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Figure 1: Switching subject and object in translation to French. Example from (Chiang, 2006). Also 
note the structural difference: there is a PP subtree in French, not present in English. 



derived) whose root node is also a. For example, 
an initial tree may have an unresolved nontermi- 
nal that wants to have an NP attached to it (it has 
a leaf labelled NP); the substitution operation at- 
taches an existing subtree whose root is NP, pro- 
ducing a new tree where that nonterminal is now 
resolved. The adjunction operation takes an exist- 
ing tree and an auxiliary tree, which has a special 
node marked as the "foot", and grafts the auxiliary 
tree in place in the middle of the existing tree, at- 
taching the tree material at the target location to 
the foot node of the auxiliary tree. For a very clear 
tutorial on TAG with good examples, please see 
(van Noord, 1993), Section 4.2.4. Synchronous 
TAG has also been investigated, and its use in ma- 
chine translation has been advocated by Shieber, 
who argues that its expressive power may make up 
for its computational complexity (Shieber, 2007). 
Restricted versions of TAG and their syn- 
chronous analogues have also been investigated. 
These do not provide the full expressive power 
of TAG, but can be parsed and trained more ef- 
ficiently. The two limited versions of TAG that 
are most prominently discussed in the literature 
are tree substitution grammars (TSG) and tree in- 
sertion grammars (TIG). TSG only provides the 
substitution operation, and does not have auxiliary 
trees or adjunction (Eisner, 2003). TIG, on the 
other hand, includes both the substitution and ad- 
junction operations, but places constraints on the 
permissible shapes of auxiliary trees: their foot 
nodes must be at the leftmost or rightmost edge 
of the frontier, and a given derivation may not ad- 
join "left" auxiliary trees into "right" ones, or vice- 
versa. These restrictions are sufficient to limit the 
weak generative capacity of TIGs to that of CFGs, 
but they also ensure that algorithms on TIGs can 
run more efficiently. While parsing with a TAG 
takes in the general case 0(n 6 ) complexity, TIG 



(like the general case for CFGs) can be parsed 
in 0(n 3 ) (Nesson et al., 2006). Both STIG and 
STSG have seen use in machine translation; for 
example, probabilistic STIG is used in (Nesson 
et al., 2006), and STSG has been notably used in 
(Eisner, 2003). 

2.2 Tree Transducers 

While synchronous grammars provide a declara- 
tive description of a relation that holds between 
two sets of trees, tree transducers more explicitly 
describe the process by which a tree may be con- 
verted into other trees. Like finite-state transduc- 
ers, which operate over strings, tree transducers 
typically describe nondeterministic processes, so 
for a given input tree, there is a set of possible out- 
put trees; that set may (for example) be described 
by a regular tree grammar. 

Tree transducers and synchronous grammars 
both describe mathematical relations over trees, 
so we can sensibly ask about their compara- 
tive formal expressive power, and use them to 
compute similar queries. For example, with ei- 
ther a synchronous grammar or a transducer, we 
may ask, for a given tree, what are the other 
trees that are in the mathematical relation with 
it (Chiang, 2006). There are transducer vari- 
eties with the same formal expressive power as 
certain synchronous grammars. For example, 
synchronous tree substitution grammars (STSG) 
have the same formal power as xLNT transducers 
(Maletti, 2010b), which will be described in more 
detail in the next section. 

While there are very many kinds of possible 
tree transducers, the ones used in NLP applica- 
tions typically fall into one of two classes, T trans- 
ducers, which operate "top-down", and "B" trans- 
ducers, which operate "bottom-up". 



synchronous grammar: a grammar over two languages simultaneously, where rules are given in 
pairs and must be used together 

probabilistic grammar: a grammar where rules have associated weights, which defines a probability 
distribution over derivations licensed by that grammar 

TAG: tree adjoining grammar, mildly context-sensitive grammar formalism over trees, with substi- 
tution and adjunction operations 

TIG: tree insertion grammar: a TAG wherein rules have certain restrictions, described in Section 
2.1. 

TSG: tree substitution grammar; similar to TAG without the adjoining operation 

RTG: regular tree grammar, the tree analogue of context-free grammar 

finite-state transducers: transducers over strings; finite-state automata with the added ability to 
produce output 

tree transducers: automata that define relations over trees procedurally 

T transducers: "top down" tree transducers 

R transducers: the same as T transducers, a name used in earlier work. "R" stands for "Root to 
Frontier" 

(L)T transducers: "linear" T transducers, constrained such that their rules are non-copying, and a 
variable appearing on the left-hand side of a rule must appear at most once in the right-hand side 

(N)T transducers: "nondeleting" T transducers, constrained such that a variable appearing on the 
left-hand side of a rule must appear at least once in the right-hand side 

(x)T transducers: T transducers with "extended" pattern matching, allowing for complex finite pat- 
terns to appear in the left-hand side of rules. 



Figure 2: Glossary of transducers and automata 



3 T Transducers 

Let us now describe T transducers in more detail. 
T transducers transform trees into other trees via 
sets of production rules. Many production rules 
may apply at a given step in a derivation, so the 
transductions are usually nondeterministic, relat- 
ing a given input tree to many possible output 
trees. Thus a T transducer, like a synchronous tree 
grammar, defines a relation over sets of trees. 

Intuitively, transduction begins with an input 
tree, where its root node is in the initial state go- 
Each node in a tree may be in one of the states in 
Q (the set of possible states), or in no state at all. 
Transduction proceeds by finding all the transduc- 
tion rules that can apply to an existing tree, or sub- 
trees of an existing tree. A rule applies when the 
root of its left-hand side matches a node in the tree, 
and the state of the node matches the state of the 
rule. When a rule matches a subtree (call it t), then 
a new tree is produced and added to the set of cur- 
rent trees by replacing the subtree that matched the 
rule with the right-hand side of the rule, save that 
the variables in the right-hand side of the rule have 
been replaced by the corresponding subtrees of t. 
Additionally, a rule may specify that subtrees of 
the new tree being produced should be in states as 
well, indicating that more transduction work must 
be done on them before the derivation is finished. 
A complete, successful transduction in a T trans- 
ducer begins with the root node being in the initial 
state, then states propagating down the tree to the 
leaves, until the entire tree has been transduced. 
See Figure 3 for an illustrative example, adapted 
from (Graehl et al, 2008). 

A new tree is produced by replacing the sub- 
tree that matched the rule with the right-hand side 
of the rule, with its variables filled in with the ap- 
propriate subtrees. The new tree is then added to 
the inventory of current trees in the usual way for 
production systems. A transduction is complete 
for a tree in the inventory when all of its nodes 
are no longer in states; at this point, the states 
will have propagated all the way from the top of 
the tree to the leaves, and then be resolved; in the 
case of translation, the symbols in the tree will be 
words in the output language. The transduction 
process is nondeterministic; many rules may ap- 
ply to a given tree in the inventory, and even the 
same rule may apply to different subtrees. To do 
a complete search for all possible transductions, 
we apply each rule to every subtree where it is ap- 



plicable, and produce every possible resulting tree; 
beam search may also be done, where search paths 
with low probabilities are pruned. 

Formally, a T transducer has the following ele- 
ments. 

• an input alphabet £ 

• an output alphabet A 

• a set of states Q 

• an initial state, typically denoted qo 

• transition rules, which are tuples of the form 

(q C Q,a C T,,tpat,p) 

The transition rule tuples specify the state that 
a given node must be in, and the symbol from the 
input language that the subtree must have, (state 
q and symbol a, respectively), in order for this 
rule to match. They also specify a tree pattern that 
forms the right-hand side of the rule, and a weight 
p for this rule. The tree pattern is a tree where 
some of the elements in the tree may be variables, 
which refer to subtrees of the left-hand side under 
consideration. 

3.1 xT Transducers 

The "extended" variation of T transducers, indi- 
cated with an "x" prefix, adds the capability for 
rules to check whether a potentially matching sub- 
tree matches a certain pattern of finite size, in ad- 
dition to the given state and value of the node. The 
tree pattern in the left-hand side of an xT transduc- 
tion rule may contain literal symbols as well as 
variables, which allows for lexicalized rules that 
only apply when certain words are in a subtree. 
The tree patterns also make it possible for the rules 
to reference material finitely far into a subtree, 
which makes local rotations straightforward; see 
Figure 4 for example xT rules that perform a local 
rotation and also use finite lookahead to produce 
Francophone names. In the notation common in 
the literature, a state for a node is written next to 
that node in the tree structure. 

While T transducers are not as expressive as 
synchronous TSG (Shieber, 2004), xT transducers 
are as expressive, and can even be used to simulate 
synchronous TAG in some cases (Maletti, 2010a). 
In addition to their formal expressive power, xT 
transducers are much more convenient for rule au- 
thors; some finite lookahead can be simulated with 
the standard T transducers, as shown in (Graehl 
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Figure 3: Example transduction steps, simplified from (Graehl et al., 2008). Note that this transduction 
is not complete because the node with the symbol "C" is still in the state q. 
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Figure 4: Switching subject and object in translation to French with an xT transducer rule. The state q2 
here indicates that we want to translate names as well. 



et al., 2008), but it is somewhat tedious. The use 
of xT transducers makes writing rules to rearrange 
material in a tree much more convenient. 

3.2 Restricted Versions of T and xT 
Transducers 

For computational efficiency purposes, we may 
also consider placing certain restrictions on the 
rules in a T or xT transducer. Options that have 
been explored include requiring that a transducer 
be linear, which means that any variable occur- 
ring in the left-hand side of a rule should appear 
no more than once in the right-hand side, and non- 
deleting, which means that a variable in the left- 
hand side must appear at least once in the right- 
hand side. Linear transducers are given the prefix 
"L", and nondeleting transducers the prefix "N", 
so for example, extended linear non-deleting top- 
down transducers are described as "xLNT". This 
particular combination of options has been used 
several times in the literature, including (Galley et 
al., 2004). Also note that the transducer in Figure 
3 is not nondeleting, since Rule 2 does not refer- 
ence its variables in its right-hand side. 

Among the benefits of adding these constraints 
on rules are that, LT and LNT transducers are 
compositional, meaning that a relation that can 
be expressed by a cascade of two LT transducers 
can also be expressed by a single LT transducer, 
and that the composition of those two transduc- 
ers can be computed. However this is not possible 



with any other members of the T-transducer fam- 
ily; even xLNT transducers are non-compositional 
(Knight, 2007). 

4 Linguistic Divergences 

Bonnie Dorr, in (Dorr, 1994), enumerates several 
different kinds of structural divergences that we 
might see in translation between languages. These 
divergences occur when translating from English 
to closely related languages, Spanish and German, 
all of which have fairly similar word orders. These 
are not the only kinds of syntactic differences that 
there can be in a translation. They do not, for ex- 
ample, cover the more large-scale reorderings that 
we see when translating between SVO and SOV 
or VSO languages. However, each of these diver- 
gences require something more than simple word 
substitution or reordering the children of a given 
node: many of these require raising and lowering 
tree material (performing "rotations", in the termi- 
nology of (Shieber, 2004)), and nested phrases that 
are present in one language are often not present in 
the other. Many of these divergences may appear 
in a given pair of translated sentences. The fol- 
lowing subsections describe Dorr's seven kinds of 
divergence. 

4.1 Thematic Divergence 

Different languages may express a situation by as- 
signing different thematic roles to the participants 
of an action, swapping (for example) the subject 



and object. For example, translating from English 
to Spanish, we see: 

• I like Mary 

• Maria me gusta a mi 

In Spanish it is more common to say that "X 
pleases Y" than that "Y wants/likes X". The Span- 
ish verb querer has the same structure as the En- 
glish "like", but the meaning of "gustar" is closer 
to the English "to like". 

4.2 Promotional Divergence 

A modifier in one language may be the head in 
another language. 

• John usually goes home 

• Juan suele ir a casa 

Here in English, an adverb modifies the verb 
to indicate that it is habitual, but in Spanish we 
use the verb "soler" (which inflects as "suele" for 
third-person singular), to express this. It has an 
infinitive as a dependent. 

4.3 Demotional Divergence 

The demotional divergence is similar to a demo- 
tional divergence, viewed in the other direction; in 
cases of demotional divergence, a head in one lan- 
guage is a modifier in the other. In (Dorr, 1994), 
a formal distinction is made between the two be- 
cause in Dorr's MT system, they would be trig- 
gered in different circumstances, but for our pur- 
poses they are effectively analogous. 

• I like eating 

• Ich esse gern 

In this example, while English uses the verb "to 
like", German has an adverb. The sentence has a 
literal translation of "I eat likingly". 

4.4 Categorial Divergence 

In cases of categorial divergence, the meaning of a 
word with a certain part of speech in one language 
is expressed with a different part of speech in the 
other. 

• I am hungry 

• Ich habe Hunger. 

The German sentence here translates literally as 
"I have hunger." 



4.5 Structural Divergence 

In cases of structural divergence, there are phrases 
in one language not present in the other. 

• John entered the house 

• Juan entro en la casa 

While the English sentence has the destination 
of the motion verb as an object, in the Spanish we 
see the prepositional phrase "en la casa" ("in the 
house"). 

4.6 Lexical Divergence 

In cases of lexical divergence, the two languages 
involved have different idiomatic phrases for de- 
scribing a situation. 

• John broke into the room 

• Juan forzo la entrada al cuarto 

While "break into" is a phrasal verb in English, 
in Spanish it is more idiomatic to "force entry 
to". This example also includes a structural di- 
vergence, as "al cuarto" is a prepositional phrase 
not present in the English. 

4.7 Conflational Divergence 

The meaning of the sentence may be distributed to 
different words in a different language; the mean- 
ing of a verb, for example, may be carried by a 
verb and its object after translation. 

• I stabbed John 

• Yo le di punaladas a Juan 

Here the Spanish sentence means literally "I 
gave John knife wounds". The words "le" and 
"a" are both required, but for different reasons: the 
verb "dar" (to give) requires the personal pronoun 
beforehand, and whenever a human being is the 
object of a verb in Spanish, we add the "personal 
a" beforehand. 

5 Implementation 

In the course of this project, I have produced a 
small, easily-understandable toolkit named kurt 
(the Keen Utility for Rewriting Trees), for experi- 
menting with weighted xT tree transducers. It is 
implemented in Python 3 and makes use of the 
NLTK tree libraries (Bird et al, 2009). kurt has 



been released as free software, and is available on- 
line . 

The software can perform tree transduction in 
general for weighted xT transducers: given a tree, 
it applies xT transduction rules and produces a 
list of output trees. The implementation is fairly 
naive, and proceeds as a simple production sys- 
tem. Partial solutions are matched against every 
rule in the transducer, then each matching rule is 
applied to the partial solution, producing a new 
generation of partial solutions. Eventually, the 
derivation either succeeds by producing at least 
one tree with no nodes in a state, or it fails if the 
input tree cannot be completely transduced by the 
given rules. The system returns all possible output 
trees, and the complete solutions are printed out at 
the conclusion of the program. 

The xT rules are straightforward to write, and 
are stored in YAML files. I have also provided 
example xT rules that translate the examples of di- 
vergences given by Dorr; these are described in 
more detail in Section 6. 

A complete and useful MT system based on this 
software - such that the rules and their weights 
were not completely the product of human knowl- 
edge engineering - would require the implementa- 
tion of a few more algorithms described in (Graehl 
et al., 2008), particularly their EM training algo- 
rithm to calculate weights for a given set of trans- 
duction rules, which depends on their transduction 
algorithm that produces the more compact rep- 
resentation of a transduction, a RTG. Decoding 
would require beam search over tree transduction, 
or perhaps over generation using this compact 
RTG representation. Additionally, some clever al- 
gorithm for extracting tree transducer rules from 
parallel treebanks would be useful for the case 
where parallel treebanks are available; some can- 
didate techniques for this last problem are dis- 
cussed in Section 8.2. 

5.1 Using the Software 

Transducers are stored in YAML files, with one xT 
transducer per file; each rule is specified as an en- 
try in that YAML file, and contains the following 
entries. 

• state: (required) The name of the state that 
a node at the root of a subtree must be in to 
match this rule 



http://github.com/alexrudnick/kurt 



• lhs: (required) The left-hand side of the 
rule: a tree pattern, typically with variables 
(tokens starting with ?) that must unify with 
a subtree in order for that subtree to match 
this rule 

• rhs: (required) The right-hand side of the 
rule: another tree pattern, which is filled in 
when this rule is applied. It may contain vari- 
ables, in which case all of the variables must 
also be present in the left-hand side of the 
rule. 

• newstates: (optional) Specifies the lo- 
cations of transduction states in the sub- 
tree produced by this rule. There may be 
many states specified in the new subtree. 
They are given in the form [location, 
statename], where location is a bracketed 
list that describes the path down the tree from 
the root of the subtree, with 0-indexed chil- 
dren. For example, to put the second child of 
the leftmost child of the root in state f oo, 
a rule would have a newstates member 

[[0,1], foo]. 

• weight: (optional) The weight for this rule. 
If unspecified, it defaults to 1.0. 

Given a file with these entries for each rule of 
a transducer, say called translation . yaml,a 
Python 3 program can use kurt to do tree trans- 
ductions in the following way, assuming the li- 
braries are all in the $PYTHONPATH or the current 
working directory. 

from loadrules import loadrules 
from translate import translate 

rules = loadrules ( "translation . yaml" ) 
tr = Tree("""(S (NP (PRP I)) 
(VP (VB am) 

(JJ hungry) ))""") 

## print all valid transductions 
translate (tr, rules) 

5.2 Simple Topicalization Example 

In Figure 5, we see a toy example of xT rules re- 
alized with the system. This is a complete running 
example that exercises many features of the soft- 
ware; it translates an English sentence into "LOL- 
cat" Internet slang, which features more promi- 



nent topicalization 2 . For simplicity, the syntac- 
tic structure of the parse tree is elided. The ini- 
tial rule matches a sentence in the initial state q, 
containing "let me show you my xo" and pro- 
duces a new sentence where "my xo" has been 
moved to the front . The rule also specifies that 
the (0-indexed) child of the S node at index 1 is in 
the state re spell. The second rule matches the 
word "Pokemon" when it is in the state re spel 1, 
replacing it with the slang spelling of "Pokemans". 
The third rule is for generalization, allowing words 
other than "Pokemon" to be translated in this po- 
sition. Due to both the second and third rules ap- 
plying to the subtree, both spellings are produced 
in the output, but the translation with the slang 
spelling is given a higher weight. 

6 xT Transducers for Linguistic 
Divergences 

I wrote xT transduction rules for the software 
toolkit that handle each of Dorr's divergence ex- 
amples. Most of the work involved was con- 
structing parse trees for the source- and target- 
language sentences; I then converted the trees into 
templates for the desired trees, at which point 
they were effectively xT transduction rules. Some 
examples are included in Figures 6 and 7, but 
the complete set of rules are in german . yaml 
and span! sh . yaml, included with the software. 
Most of the transformations required to implement 
these rules are instances of local rotations, as de- 
scribed by (Shieber, 2004). 

7 Related Work 

In addition to the work on tree-based MT, some 
very sophisticated string-based MT algorithms 
have been framed in terms of finite-state trans- 
ducers. Not long after the introduction of modern 
word-based SMT, Knight and Al-Onaizan showed 
that IBM Model 3 could be expressed with a cas- 
cade of FSTs (Knight and Al-Onaizan, 1998). 
Since string transducers can be composed, de- 
coding in this case becomes one enormous beam 
search over a single state machine. Similarly, 
Shankar Kumar and William Byrne expressed the 
phrase-based alignment template model as FSTs 
(Kumar and Byrne, 2003). The last part of the 
decoding process in Chiang's hierarchical phrase- 
based model can also be described in terms of 



Readers may or may not be familiar with the moderately 
popular catchphrase "My Pokemans, let me show you them". 



FSTs (Iglesias et al., 2009); Iglesias et al. use 
finite-state techniques to traverse a lattice of possi- 
ble translations once chart parsing with an SCFG 
has completed. 

For tutorials and related algorithms, Chiang 
provides an excellent introduction to synchronous 
grammars in (Chiang, 2006). My understanding 
of TAG was greatly aided by the TAG section in 
(van Noord, 1993); it is referenced in the TAG 
Wikipedia page. For overviews of different ap- 
plications of T-family tree transducers and their 
various properties, in a very approachable style, 
(Knight, 2007) and (Knight and Graehl, 2005) are 
very helpful. Additionally (Graehl et al., 2008) 
contains excellent examples for understanding xT 
transduction (one of which is in this paper in sim- 
plified form, though the original example is worth 
working through and understanding fully), along 
with a set of algorithms that can be computed over 
xT transducers, including an EM procedure that 
can be used to estimate the weights for an xT 
grammar given a parallel treebank. 

8 Conclusions and Future Work 

Here I have described the "T" family of tree trans- 
ducers and situated them among the various for- 
malisms for describing relations over strings and 
trees; I have also demonstrated that xT transducers 
are sufficient for handling translation across the 
linguistic divergences described by Dorr. I have 
presented a software package suitable for experi- 
mentation with xT transducers, which comes with 
example translation rules that perform translations 
over each of the divergences. 

There remains significant work to be done on 
the topic; for example, to my knowledge, there is 
no easily available end-to-end MT system based 
on tree transducers, either commercial or Open 
Source. There are many more questions that I 
would like to answer; as far as I know, these are 
open problems in the field. 

8.1 Transducers, Disambiguation, and 
Language Models 

While weighted synchronous grammars and xT 
transducers provide generative models of transla- 
tion, the probabilities that they assign to a given 
rule are set ahead of time, and are not conditioned 
on features of the surrounding context. It may 
be fruitful to try using discriminative approaches 
(i.e., classifiers) to help a transducer-based MT 



## lolcat topicalization (fronting) 

- state: q 

Ihs: (S let me show you my ?x0) 

rhs: (S my ?x0 , let me show you them) 

newstates : 

- [ [ 1] , respell] 

- state: respell 
Ihs : Pokemon 
rhs : Pokemans 
weight : 0.9 

- state: respell 
Ihs: ?x0 

rhs: ?x0 
weight : 0.1 



Figure 5: xT rules for translating into LOLcat dialect, which features topicalization, in the YAML format 
used by the software implemented as part of this work 



# 


handl 


e <pronoun> like 


<gerund> 




- 


state 


: q 










Ihs: 


(S (NP ?x0) (VP 


(VB 


like) 


?xl) ) 




rhs : 


(S (NP ?x0) (VP 


?xl 


(RB g 


3rn) ) ) 




newstates : 










- [to 


, ] , lookup] 










- t[l 


, 0], gerundtotensed] 




# 


handl 


e I am <adj> 








- 


state 


: q 










Ihs: 


(S (NP ?x0) (VP 


(VB 


am) ? 


*1) ) 




rhs : 


(S (NP ?x0) (VP 


(VB 


habe) 


?xl) ) 




newstates : 










- [to 


, ] , lookup] 










- [[1 


, 1 ] , adjtonoun] 








## simp 


le lookups for k 


nown phrases 




state 
Ihs: 
rhs : 


: lookup 
(PRP I) 
(PRP ich) 








## POS 


changes . 










state 
Ihs: 
rhs : 


: gerundtotensed 
(VBG eating) 
(VB esse) 
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Figure 6: Sample translation rules for German 



# handle <pronoun> like <name> 

- state: q 

lhs: (S (NP ?xO) (VP (VB like) (NP ?xl) ) ) 
rhs: (S (NP ?xl) (VP (NP ?xO) (VB gusta) ?xO) ) 
newstates : 

- [[0, 0], lookup] 

- [[1, 0, 0], objectivize] 

- [[1, 2], tothisperson] 

- state: tothisperson 
lhs: (PRP I) 

rhs: (PP (A a) (PRP mi) ) 

# handle usually -> soler 

- state: q 

lhs: (S (NP ?x0) (VP (RB usually) ?xl) ) 
rhs: (S (NP ?x0) (VP (VBZ suele) ?xl)) 
newstates : 

- [[0,0], lookup] 

- [[1,1], unconjugate] 

# handle entered-ob ject -> entro en ... 

- state: q 

lhs: (S (NP ?x0) (VP (VBD entered) ?xl) ) 

rhs: (S (NP ?x0) (VP (VBD entro) (PP (IN en) ?xl) ) ) 

newstates : 

- [[0,0], lookup] 

- [[1,1,1], lookup] 

# handle broke-into X -> forzo la entrada a X 

- state: q 

lhs: (S (NP ?x0) (VP (VBD broke) (PP (IN into) ?xl))) 
rhs: (S (NP ?x0) (VP (VBD forzo) 

(NP (DT la) (NN entrada) (PP (IN a) ?xl) ) ) ) 
newstates : 

- [[0,0], lookup] 

- [[1,1,2,1], lookup] 

- [[1,1,2], al] 

# handle I stabbed X -> le di puhaladas a 

- state: q 

lhs: (S (NP ?x0) (VP (VBD stabbed) ?xl) ) 
rhs: (S (NP ?x0) (VP (PRP le) (VBD di ) 

(NP (NN punaladas)) (NP (A a) (NNP Juan)))) 
newstates : 

- [[0,0], lookup] 

Figure 7: Sample translation rules for Spanish 



system make decisions about which rules are the 
most likely to apply in a given context, either 
based on the surrounding tree material, or on the 
surface words in the source-language sentence. It 
may turn out that there is a more principled way 
to achieve the same benefits, perhaps by adding 
more conditions on the probabilities in a genera- 
tive model. However, cross-language phrase-sense 
disambiguation with classifiers, like in the work 
of Carpuat and Wu (Carpuat and Wu, 2007), has 
proved useful for phrase-based SMT. For phrase- 
based SMT in general, discriminative approaches 
such as Minimum Error-Rate Training (MERT) 
(Och, 2003) have become quite typical. 

Another guide for the tree transduction process 
could be language models, either flat n-gram mod- 
els or structured ones, which would have the added 
benefit that they could be trained on larger corpora 
than those used to produce the tree transduction 
rules in the first place. 

8.2 Extraction and Training Transducers 

Thus far, it seems as though there is no agreed- 
upon best approach for extracting a set of tree 
transduction rules from a parallel treebank, such 
that a tree-to-tree MT system could be con- 
structed. While parallel treebanks are not abun- 
dant, with sufficiently good monolingual parsers, 
parallel trees can be created from bitext, and hope- 
fully these could be used to induce transduction 
rules for tree-to-tree MT systems. Other work 
has presented methods for learning tree-to-string 
transduction rules, for example (Galley et al., 
2004) and (DeNeefe and Knight, 2009). These 
approaches for learning tree-to-string transducers, 
if I understood them more completely, might turn 
out to generalize easily to the tree-to-tree case, but 
if so, it is not yet obvious to me how to do this. 

One proposed approach for learning relations 
over trees is given in (Eisner, 2003), in which 
Eisner presents algorithms for both extracting an 
STSG grammar and training its weights; STSGs 
can then be expressed as xT transducers as de- 
scribed by Maletti in (Maletti, 2010a). Addi- 
tionally, approaches for leaning tree transduction 
rules have been suggested for tasks other than ma- 
chine translation, particularly in the summariza- 
tion work of Cohn and Lapata (Cohn and Lap- 
ata, 2007), (Cohn and Lapata, 2008), who work 
with a corpus that not only has parse trees for both 
source and target languages (in their case, pairs 



of longer and paraphrased sentences, both in En- 
glish), but has also been word-aligned. The word 
alignments inform their grammar extraction. Cohn 
and Lapata use a very small training paraphrase 
corpus (480 sentences), which suggests that per- 
haps their methods would be useful for MT with 
low-resourced languages. They also use of dis- 
criminative methods for training and decoding. 
Both their algorithm for rule extraction and the 
tree transducers with discriminative methods may 
have been used in tree-to-tree MT system, but I 
have not yet found work that describes this; if it 
has not yet been tried, someone should explore it. 

8.3 XDG as Transducers 

Given that many grammar formalisms are express- 
ible in terms of tree transducers, one wonders 
if constraint-based dependency frameworks, such 
as Extensible Dependency Grammar (Debusmann, 
2006), which has been used by Michael Gasser 
for machine translation (Gasser, 2011), could be 
expressed in terms of tree transducers. Transduc- 
ers over dependency trees have already been used 
for machine translation, for example by Ding and 
Palmer (Ding and Palmer, 2005). However, XDG 
defines not just one layer of dependency analy- 
sis for a language, but several. Its analysis of a 
sentence in a given language is a multigraph with 
multiple dimensions of analysis, with constraints 
describing permissible structures on each dimen- 
sion, as well as the relationships between dimen- 
sions. This suggests that perhaps XDG could be 
expressed as a cascade of transducers, with each 
layer in the cascade describing the relation be- 
tween one XDG dimension and the next. 

A problem with this interpretation is that not all 
layers of an XDG multigraph are tree structures. 
This might mean that XDG cannot be cleanly ex- 
pressed in this way at all, or perhaps that another 
kind of transducer that operates on graphs more 
generally could be used. Alternatively, perhaps 
XDG could be tweaked such that every layer has a 
tree structure. 

If it is in fact possible to express XDG trans- 
lation rules as a cascade of transducers, then this 
would present a clear path for integrating ma- 
chine learning into the largely rule-based sys- 
tem, making use of the training algorithms al- 
ready present in the literature. As a fairly mod- 
est step, given small numbers of parallel training 
sentences, one could use EM to train the weights 



of the transduction rules that implement the XDG 
grammar. More ambitiously, one could perhaps 
extract grammar rules from example translation 
pairs, although the XDG parse graphs would have 
to be provided by an expert, for each layer in the 
analysis. This could be done either simply on de- 
mand, when the existing grammar fails to parse 
and translate a sentence, or using active learning 
to select sentences for human annotation. 

One problem not addressed at all in the liter- 
ature that I have seen is how to translate, either 
into or out of, morphologically rich languages us- 
ing tree transducers. It seems as though morpho- 
logical analysis and lemmatization would be an 
important first step in a transducer-based MT sys- 
tem, to limit the number of rules that the system 
needs to consider, but then the morphological in- 
formation should be used to help the system make 
choices during transduction (decoding). Perhaps 
morphological features would be useful to classi- 
fiers trained to help make syntactic disambigua- 
tion decisions. 
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